Add snippet for impacts on parquet file of data type and compression algorithm #26

ThAccart · 2024-11-08T10:03:05Z

Not 100% sure this leads used to the point we want to demonstrate.
Could you have a look at it ?

…ata, depending on chosen compression algorithm and chosen data type.

mauriciojost · 2024-11-15T09:03:53Z

spark3/parquet-compression.choose-types-wisely.scala

+// COMMAND ----------
+
+/*
+This snippet shows who data type for numerical information and compression can affect Spark.


Typo: how.

mauriciojost · 2024-11-15T09:20:01Z

spark3/parquet-compression.choose-types-wisely.scala

+This snippet shows who data type for numerical information and compression can affect Spark.
+
+# Symptom
+Storage needs does not match with expectations, for example is higher in output after filtering than in input.


Typo, needs do not match....

I'd suggest for example volume is higher.

mauriciojost · 2024-11-15T09:20:51Z

spark3/parquet-compression.choose-types-wisely.scala

+Storage needs does not match with expectations, for example is higher in output after filtering than in input.
+
+# Explanation
+There is difference in in type when reading the data and type when writing it, causing a loss of compression perfomance.


Typo, double in.

mauriciojost · 2024-11-15T09:21:03Z

spark3/parquet-compression.choose-types-wisely.scala

+
+// COMMAND ----------
+
+// We are going to demonstrate our purpose by converting the same numerica data into different types,


Typo, numerical.

mauriciojost · 2024-11-15T09:21:27Z

spark3/parquet-compression.choose-types-wisely.scala

+// and write it in parquet using different compression
+
+
+// Here are the type we want to compare


Typo: types.

mauriciojost · 2024-11-15T09:23:11Z

spark3/parquet-compression.choose-types-wisely.scala

+      .option("compression", parquetCompressionName)
+      .format("parquet")
+      .mode("overwrite")
+      .save(fileName)


I'd reuse the mechanism for creating random and disposable temporary directories that is already present in the other notebooks. This way temporary files land in the predefined and already set up trash directory. See the other snippets for reference (they use an uuid, you can use the same imports and snippet).

mauriciojost · 2024-11-15T09:42:57Z

spark3/parquet-compression.choose-types-wisely.scala

+  .mapValues(_.map(_._1).sum).toSeq.sortBy(_._2)
+
+println("part* files sizes (in kB):")
+sizeOnDisk.foreach( o=>println ( s"${o._1}\t${o._2}\tkB"))


Using implicits, you can use toDF to display a more friendly table:

mauriciojost · 2024-11-15T10:12:16Z

spark3/parquet-compression.choose-types-wisely.scala

+
+// COMMAND ----------
+
+// now we can also add a check on the effect of choosing a specific number type on the obtained values


I'd rephrase this as follows:

// Now let's see how long it takes to read and process such data when using different compression and numerical data formats .

mauriciojost · 2024-11-15T10:13:57Z

spark3/parquet-compression.choose-types-wisely.scala

+
+// COMMAND ----------
+
+// Now you should also check how the variety of values affects compression :


I'd encourage you to add some conclusions here so that people reading the snippet can take action and understand what to expect with their change.

Add snippet demonstrating effect on parquet file size for identical d…

df3050d

…ata, depending on chosen compression algorithm and chosen data type.

mauriciojost reviewed Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add snippet for impacts on parquet file of data type and compression algorithm #26

Add snippet for impacts on parquet file of data type and compression algorithm #26

ThAccart commented Nov 8, 2024

mauriciojost Nov 15, 2024

mauriciojost Nov 15, 2024

mauriciojost Nov 15, 2024

mauriciojost Nov 15, 2024

mauriciojost Nov 15, 2024

mauriciojost Nov 15, 2024

mauriciojost Nov 15, 2024

mauriciojost Nov 15, 2024

mauriciojost Nov 15, 2024

mauriciojost Nov 15, 2024


		// COMMAND ----------

		// We are going to demonstrate our purpose by converting the same numerica data into different types,

		// and write it in parquet using different compression


		// Here are the type we want to compare


		// COMMAND ----------

		// now we can also add a check on the effect of choosing a specific number type on the obtained values


		// COMMAND ----------

		// Now you should also check how the variety of values affects compression :

Add snippet for impacts on parquet file of data type and compression algorithm #26

Are you sure you want to change the base?

Add snippet for impacts on parquet file of data type and compression algorithm #26

Conversation

ThAccart commented Nov 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment